Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e ., the
Foreffectiveknowledge transfer,weadopt the idea of domain classifier so that student training is guided by discriminative features invariant totherepresentational space shift between teacher andstudent.
Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e ., the